DOCUMENT RESUME 



ED 270 458 



TM 860 307 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Cason, Carolyn L. ; And Others 
Reviewer Standards in Division I Program 
Selection . 
Apr 86 

28p.; Paper presented at the Annual Meeting of the 
American Educational Research Associaton (70th, San 
Francisco, CA, April 16-20, 1986). 
Speeches/Conference Papers (150) — Reports - 
Research/Technical (143) 

MF01/PC02 Plus Postage. 

♦Conference Papers; Data Analysis; Educational 
Research; *Evaluation Criteria; *Examiners; 
Mnterrater Reliability; Measurement Techniques; 
Models; *Quality Control; *Rating Scales; 
Reliability; Research Reports? Standards? Test 
Validity 
♦Reviewers 



ABSTRACT 

Cason and Cason' s model of performance rating was 
used tc determine the extent to which variation in reviewer standards 
affected the reliability and validity of the program review process 
used to select papers for inclusion in the annual program. Data 
analyzed were the overall recommendation for acceptance and ratings 
on seven quality criteria from each reviewer on each paper proposal 
in 1983, 1985, and 1986. The Casons' model fit each year's data. 
Significant rater stringency variance was found for each of the three 
years. Rater stringency persisted up to three years providing strong 
construct validation for the model. Removing the rater stringency 
effect improved reliabilities from .768, .722, and .739 to .813, .790 
and .790. Construct validities also improved. Had adjusted ratings 
been used in 1986, up to 6 of the 35 papers accepted would have been 
rejected. There were no significant differences in mean rater 
standards year to year; however, mean paper proposal quality was 
sharply lower in 1985. In all years, mean paper quality of accepted 
proposals was significantly better than that of rejected proposals. 
Access to adjusted ratings at the time of the selection decision 
would ease the committee's task and probablv improve the quality of 
its decisions. (Author/PN) 



****************************************** **************************** 

* Reproductions supplied by EDRS are the best that can be made 

♦ from the original document. 

************************************************* ********************* 



t 



Reviewer Standards in Division I Program Selection* 

OQ Carolyn L. Cason 

^ Gerald J. Cason 

*V University of Arkansas for Medical Scienoes 

and 

O Prank T. Stritter 

fs. University of North Carolina Chapel Hill 



Address correspondence to: 



Carolyn L. Cason 

UAM3-OON-529 

4301 West Markham 

Little Rock, Arkansas 72205 

(501) 661-5163 



PERMISSION TO REPRODUCE THlS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC, ' 



n _ U « APARTMENT OF EDUCATION 

Office of Educational Research and improvement 
EDUCAT.ONAL RESOURCES INFORMATION 

CENTER (ERIC, 
^Th )S document has been reproduced as 
received from the person 0 r organi/at-on 
originating it 

□ Minor changes have been made to .mprove 
reproduction quality 

• Points of view or opinions stated in this docu 
ment do not necessaniy represent offmai 
OEPI posMion or policy 



0 

ERIC 



♦Presented at the annual meeting of the American Educational Research 
Association, San Francisco, April 1986. 

We wish to thank the chairs and members of the 1983, 1985, and 1986 Progran 
Committees for making the data available. Providing this kind of data for external 
analysis requires a rar* and courageous commitment to scholarly and research cannons. 

Dick Calkins of Technology, inc. , Houston, Texas provided usefuJ statistical 
assistance. 

The staff of the Academic Support Center, UAMS provided special programming and 
data processing support. 



Cason, Cason, & Stritter 



Page 2 



SHORT ABSTRACT 

Cason and Cason f s model of performance rating was used to determine the extent 
to which variation in reviewer standards affected the reliability and validity 
of the program review process used to select papers for inclusion in the annual 
program. Data analyzed were the overall recommendation for acceptance and 
ratings on seven quality criteria from each reviewei on each paper proposal in 
1983 (NR=87, NP-120), 1985 (NR=86, NP-100), and 1986 (NR=82, NP-115). The 
Casons 1 model fit each year f s data: R> .756; p<. 00001). Significant rater 
stringency variance was found for each of the three years. Rater stringency 
persisted up to three years providing strong construct validation for the 
model. Removing the rater stringency effect improved reliabilities from .768, 
.722, and .739 to .813, .790 and .790. Construct validities also improved. 
Had adjusted ratings been used in 1986, up to 6 of the 35 papers accepted 
would have been rejected. There were no significant differences in mean 
rater standards year to year; however, mean paper proposal quality was sharply 
lower in 1985. In all years, mean paper quality of accepted proposals was 
significantly better than that of rejected proposals. Access to adjusted 
ratings at the time of the selection decision would ease the committee's task 
and probably improve the quality of its decisions. 



Cason, Cason, & Stritter ^ 3 

Reviewer Standards in Division I Program Selection 

Carolyn L. Cason 
Gerald J. Cason 
University of Arkansas for Medical Sciences 

and 

Frank T. Stritter 
University of North Carolina Chapel Hill 

Paper proposals selected for presentation at the Division I progran of the 
annual AERA meeting are intended to address researcn issues of interest to vhe 
division membership and reflect careful and sound application of scientific method. 
These presentations communicate new scientific knowledge while at the same time 
providing a mechanism to formally acknowledge those who made the contribution- Thus 
what is selected for presentation becomes a matter of importance to both the body of 
scientific information and the individual researcher's professional career. 

In Division I the Program Committee decides which paper proposals will be 
accepted for presentation, in general terms, the objective of the Progran Committee 
is to accept the best proposed papers from among those meeting at least minimal 
standards of scholarship. The committee is aided in ics decision making by multiple 
^eviews of each paper proposal completed by Division I members who have volunteered 
to do so. To the extent that reviewers use the same or similar standards in making 
their reviews, such reviews aid the Program Committee in selecting quality paper 
proposals. However, the assimption of similar standards is suspect as reviewers have 
different backgrounds, different experiences, and different levels and areas of 
expertise in researcn, although use of multiple reviews of a single paper proposal 
may attenuate the effects of variation in standards among reviewers (Ebel, 1951; 
Stanley, 1961). The impact of such variation among reviewers on paper proposal 
selection has been largely unevaluated. 

Even though peer review is used as the basis for making many highly important 
decisions about scientific products (e.g., in promotion, funding, publishing), it has 
received only very limited attention as a topic of research. Marsh and Ball's (1981) 
study of manuscripts submitted to the Journal sL Educational Psychology is the only 
one which has examined reviewers' ratings for variation in reviewer standards and the 
impact of corrections for reviewer bias on reviewer agreement of manuscript quality. 
Marsh and Ball's study is particularly relevant to the issue of variation of rater 
standards and their impact in the Division I review process because their study deals 
with a similar substantive research area, of equal breadth and complexity, where the 
same types and varieties of research methods are used, and a similar peer review 
process exists. They found significant systematic variation in mean ratings of 
different reviewers with single reviewer reliabilities of .34; somewhat above the 
middle of those reported in the psychological and sociological literature which 
ranged from .08 to .54. Corrections for response bias (variation in reviewer 
standards) did not yield statisically significant improvements in single reviewer 
reliabilty (.35). Although Marsh and Ball concluded that the observed variation in 
ratings was primarily a function of manuscript quality rather than reviewer bias, 
their results suggest that correction for reviewer bias was inadequate: manuscript 
quality may have been confounded with reviewer bias. Had Marsh and Ball used a more 
powerful method to test and correct for the presence of reviewer bias, they might 
have found both a significant reviewer effect and a significant improvement in 
^ !f" r f Vla ' er agreement arisin 9 from correction for reviewer bias. A major obstacle 
which they coiif ronted was having very little data for the analysis: only two 



ERIC 



4 



Cason, Cason, & Stritter 

Page 4 

SFSS^S* tri^ifdon*^ K^f ° £ ,* U available manus «i£* 

SJL.tion fV^^^;^ ^rs^™ *— s 

Theory 

that ^ 3 d g£? ^ G * ^ n et £' 1?83; C « ^ et have found 

Performance to be evaluated is conplex. Cason and Cason' s (1984) rcSl^oJ ifes Si 
basis for answering the questions (a) are there differences irT standard uS ^ 
different raters, and (b) do such differences (if they exist! ^prc^siakifiSnt^v 

fS!Sri^- ^ llluSt 5 ated ln f^re 1» Cason and Cason' s theory posits that "the 
reo " ved * a B *0«* is * function of that subject's true ability Ind SS 
rater's characteristics including the rater's ifiSQlYinfl Snsitivitv 
^ringB^ and eff^cti^ x^ting. £Jto ^ filing" (cfSnTgsonfSk ,^f^f ! 
S^iS? siirplif led model of their performance rating theory accounts for 111 
Sr^S, C Va S atl . n m ^ rforma n<* "tings exclusively by variation in raXr 
££212* . SUbjeCt abilit ^ 33 "Crated in Figure 2! In the CasonS' node the 
SSJ ^EVS 11 * (ESR) ' as a Percent of the .axiw ratiiTL 

SSfSS £t£ iSerenS 0 ^^; " ter ' S stri <^<* (i.e., value associate? 

wicn cne hater Reterence Point or RRP) and the subject's abilitv lie va iiiP 

In Ptwious research this relationship was modified by an arbitrary scalinq factor 

2E2Si LS'S'iEa- »JJ*««f«»"y PoSmated cur»ilLafrSIti!S?p 
oecweenz and the expected subject rating (ESR) has been stioulated a«? t-hP 
unit-normal ogive. Thus, the ESR (in percent) f or a given z is 2ual to 22 

S^'Srtfi™^,^?^- ^ telW Z; that 91 iir p(z) S ^Ss °luuf 
p(z) x 10 °- 11113 1S a deterministic, not probabilistic relationship. 

Data Source and Method 

appa ^ 6 ratin ^ 9 iven ^ ^dividual reviewers to paper proposals submitted to the 

£hf daS V1S ^ r Lf r09ram ?™* ttees the 1983, 1985, and 1986 pr^lonstit^ 
the data. The ratings used were the overall recommendation for acceotance <Thl 

VXm^trJZr^** £ 2=aCOeE * With -servatio^Tac^pTonly^ 

r °™ ^™ ltS L aM 4=re]ect - Por a* Purposes of tnis analysis the scale waTreversed 

m - Sn, ° U '? itSJ 100% = 67% = acoe Pt ^ rServ^Uo^, 

wiiS^S^iE" 'v™* 8 ' ° % = reject ' were completed on 

rev -ewer acceptability ratings of 120 (1983), 100 (1985) . and 115 (1986) ram r 

fSS 0 ^,QS thD S gtl . ea ? h Wery pr0p)Sal was ^r refiew to 4 raterT, "toth 
tolhe^eUSLS FS?**' ^ n ° rating was noted on work sne^to prcvSe^ 
In 183 Xlt?J? rm on SQme Proposal-reviewer combinations. 

reviewers f rom2 to tTES*"?™ ^ - rWlared £ ™ 19 to 23 P r °F°sals; other 
rwiewers rran 2 to 5. In 1985, Program Caimuttee members each reviewed from 10 to 



ERJ.C 



5 



Cason, Cason, & Stritter 



Page 5 



18; others reviewed from 1 to 9 (median = 3) . In 1986, Program Committee members 
each reviewed from 13 to 17 proposals, other reviewers from 1 to 9 (median = 4). 

On each proposal reviewed for the 1986 program, reviewer evaluations of proposal 
quality on each of the seven criteria, as well as their recommendation on disposition 
(acceptability rating) were analyzed. To conduct item level quantitative analyses, 
the extreme left end of the scale (see Figure 4 for an illustration of the reviewer 
inventory) was assigned a value of 1 and the right most end a value of 5. The "+" 
marks on the scale were associated with the consecutive numbers 2, 3, and 4. For the 
criterion "Clarity of Summary" obsure=l and clear=5. in those instances in which the 
reviewer's mark fell between "+"s, the numeric value assigned was that which 
represented the "+" closest to the reviewer's mark, in the judgement of the 
researcher recoding the data onto machine scannable answer sheets. See Figure 5 for 
an example of the machine scannable answer sheets. ith the recoding, the 
behaviorally anchored scale was treated as a Likert one to facilitate the analysis 
and reporting process. This item level data was processed through the (JAMS Objective 
Test Scoring and Performance Rating (OTS-PR) system. A full set of standard reports 
was obtained including those providing the rated quality of the proposals and the 
quality of the rating inventory as reflected in intra-class correlation measures of 
inter-rater consistency. 

Using regression analyses (Ward & Jennings, 1973) which were based on an 
improved version of the procedures described in technical detail in Cason and Cason 
(1985) , RRPs and SAPs were estimated using ratings given by individual reviewers to 
paper proposals. The data f ran each year were analyzed separately. The improved 
method used here differed from that reported in 1985 in the following ways. 
Estimation of parameters for a given data set involved two successive regression 
analyses upon the same data, in each, the model followed the same general form given 
in 1985. However, in the first the criterion contained percent scores. In the 
second, the criterion contained the z transforms of the expected values f rem the 
first regression analysis. This approach provided better approximations of least 
squares solutioiis for the theoretical model. 

Results 

Descriptive statistics on observed ratings are given in Table 1. On the face of 
it, the observed ratings are so consistent with respect to mean and standard 
deviation it would seem tempting to assume that both average proposal quality and 
average rater standards remained consistent between 1983 and 1986. This turns out 
not to be the case. 

Table 1. Observed Acceptability Ratings of 1983, 1985, and 1986 Proposals 

1983 1985 1986 

Mean 43.24 50.10 48.73 

Standard Deviation 27.43 27.56 27.47 



As can oe seen in Table 2, Cason and Cason' s model obtained a non-chance fit to 
the 1983, 1985, and 1986 paper proposal reviewer acceptablity ratings. While there 
was no global, significant rater "main effect" in 1983 (F=0.79;df=86,275;p=0.89) , 
there were significant rater effects in 1985 (F=1.48; df=85,210; p=0.013) and 1986 
(F=1.42; df=81, 258; p=0.021) . For details on the way in which these effects were 
tested, see the description of the statistical models provided in Cason and Cason 



9 

ERIC 



Cason, Cason, & stritter 



Page 6 



(1984) . Even though a significant rater stringency effect was not observed in the 
1983 data there oould still be variation in standards used by individual reviewers. 
The absence of a significant rater effect indicated that the mean stringency of the 
group of reviewers who rated each proposal was statistically equal to (i.e., not 
dirterent f ran) the mean across all reviewers, statistically significant overall tit 
of the model is prerequisite to establishment of the presence and importance of both 
of the tormal model constructs: stringency and proposal quality. The significant 
rater effects in 1985 and 1986, considered in conjunction with proportions of 
variance accounted for by rater stringency and proposal quality, clearly validate 
both constructs in these data. * 1 1 

Table 2. Fit of Cason and Cason' Model to 1983, 1985, and 1986 
AERA Division I Program Review Data 





Year 




1983 


1985 


1986 


.759 


.786 


.776 


.117 


.189 c 


.144 c 


.459 


.393 


.415 


.424 


.418 


.441 


87 


86 


82 


120 


100 


115 


480 


394 


453 



Multiple R a 
Components of Variance 

Reviewers 

Proposal Quality" 

Error (1-R 2 ) 

Number of Reviewers 
Number of Proposals 
Number of Observations 

^All Rs are significant at p<. 00001. 
c Same as r. or single rater r . 
Significant rater effect; p < ?025. 

Table 2 also shows tne relative contribution of reviewer standards (stringency) , 
proposal quality, and random error in the reviewers' acceptability ratings in each of 
the 3 years. Components of variance in Table 2 were estimated as a sum of the 
products of the respective standardized weights (Beta.) and correlations (r. ) 
between predictor variables and the criterion in the regression analysis: ly 

(Equation 1) 

Proportion of Variance = (Beta^^ * z.) 

where i= 1 to n proposals; or, 1 to k reviewers. 



The summation of products is across the set of either reviewer or paper 
proposals (Hays, 1963) . The proportion of variance contributed by variation in 
reviewer standards/stringency ranged from 12 to 20%. This is an important, althougn 
modest, amount of total variance to be removed from the error term where it would be 
placed in an analysis making no provision for variance in rater standards as an 
explicit measurement design variable. 

The relatively modest amount of total variance attributable to rater stringency 
is very misleading with respect to magnitude of the impact of an individual rater's 
standards upon the rating given to individual proposals of different levels of 
quality. Table 3 illustrates this point using the 1986 data. Tne table contains the 
expected rating for combinations of high, average, and low quality proposals (high 



ERIC 



7 



Cason, Cason, & Stritter 



Page 7 



and lew being defined as ± 1 s.D. fron the mean SAP) and high, average and low 
stringency raters (where high and lew is defined as ± 1 S.D. frcm the mean rrp). as 
the distribution of SAPs and RRPs in these data are approximately normal, these 
stipulative definitions of high and low avoid potentially misleading extreme outlyer 
cases. For example, it can de seen frcm the values in Table 3 that a paper proposal 
of mean intrinsic quality could have received a rating either near outright rejection 
or acceptance depending on whether a rater of high or low stringency had reviewed it. 

Table 3. Rating (in %) Expected from Raters witn 
High (+1SD) , Mean, and Low (-1SD) Stringencies 







Stringency 






High (+1SD) 


Mean 


Low (-1SD) 




RRP=593 


RRP=496 


RRP=399 


Proposal Quality 






High (+1SD) 








SAP=606 


56 


86 


98 


Mean 








SAP=487 


15 


46 


81 


Low (-1SD) 








SAP=368 


1 


10 


38 



According to Hays (1963, p. 424) , the intra-class correlation (r. ) is a 
va^iana ° f ^ Varianoe attriDuta ble to an effect KT^ as a proportion c of total 



(Equation 2) 

r ic = ^a / (cr a +<r e ) 

The proportion of variance attributable to proposal quality reported in Table 2 can 
thus be interpreted as the intra-class correlation of reviewers with respect to their 
observed acceptability ratings of the proposals. As Hays points out, this is 
equivalent to the reliability of a single reviewer's observed acceptability ratinq. 
Alternatively, this value may be interpreted as the expected correlation between the 
ratings given by randomly chosen pairs of reviewers. The reliability of a mean of 
several reviewers' ratings, as is available in these data (where number of reviewers 
- k) , is given by the Spearman-Brown expansion formula: 

(Equation 3) 

r R = (k * r)/(l + ((k - 1) * r)) 

where r = the reliability of a unit length measure, in this case a sinqle 
reviewer; and, 
k = number of reviewers. 

«.- JS? 164 J* 0 " 8 the impact of justing acceptability ratings on the reliability 
ot both single reviewer and aggregate ratings obtained frcm 4 reviewers. The values 
tor the single reviewer adjusted ratings were obtained by including only the sun of 
<'*™ r J£ ^ P™? 0821 varianoes ^ the denominator of Equation 2. The unadjusted 
(observed) acceptability ratings must include the variance associated with reviewers 
in addition to that associated with proposals and error (Ebel, 1951) . Thus, so long 



ERIC 



8 



Cason, Cason, & stritter 



Page 8 



as varianoe attributable to reviewers is greater than zero (regardless of the 
presence of a significant reviewer effect), adjusted acceptability ratings must have 
higher reliabilities than unadjusted ones. Therefore, as can be seen in Table 4 the 
04 ^"sted ratings for each Division I year analyzed is greater than the 
S^Ht? ^P** upon the observed ratings. For purposes of comparison, the 
reliability for observed and adjusted ratings of a single and an aggregate of four 
raters in Marsh and Ball's study are given. The value obtained by Marsh and Ball in 
each of these cases is systematically lower than the lowest comparable value obtained 
by our analysis of Division I review data. 

Table 4. Reliability of Ratings 
Intra-class Correlations 

Single Rater Aggregate of Raters 

k=l k=4 

Observe 

Marsh and Ball 
Division I 
1983 

1985 

1986 

While the fit of the model to the data reported in Table 2 and the presence of 
£ "j£ 1£1Cant rater effects in the 1985 and 1986 data support the validity of the 
model, its constructs, and its appropriateness to the kind of rating data under 
consideration here, further, stronger support for the model is available in the 
results reported in Table 5. Table 5 contains the correlations between reviewer 
stringencies (RRPs) estimated on reviewers who participated in progran review in more 
than one year. For those who reviewed in both 1983 and 1985 and those who reviewed 
in both 1985 and 1986 there was a low but statistically significant correlation in 
their RRPs. The 1983-1986 correlation failed to reach statistical significance. 
These data clearly show that stringencies reflect seme substantive characteristic of 
the reviewer which persists over a period of up to two years. The significant 
correlation between 83-85 reviewer stringencies emphasizes that while a significant 
rater "main" effect was absent, true differences in reviewer standards were measured. 
As there were real differences in reviewer standards in each year, adjustments for 
variation in rater standards produced real (i.e., statistically significant) 
improvements in reliability. 

Table 5. Stability of Reviewer Standards over Time 



Observed 


Adjusted 


Observed 


Adjusted 


.340 


.350 


.670 


.683 


.459 


.520 


.768 


.813 


.393 


.485 


.722 


.790 


.415 


.485 


.739 


.790 





1985 


1986 


1983 






r 


.33 


.14 


n 


41 


32 


P 


.02 


.23 


1985 






r 




.27 


n 




40 


P 




.05 



ERJ.C 



9 



Cason, Cason, & stritter 



Page 9 



The results in Table 5 shewing the persistence of consistent reviewer standards 
over time (as reflected in RRPs) are stronger than and therefore provide greater 
support for the theoretical model underlying these analyses than the only previously 
published comparable results (Cason & Cason, 1984, Table 3 p. 240). Althouqh 
consistency among raters represented by an intra-class correlation (Ebel, 1951 • 
Stanley, 1961) is frequently interpreted as a measure of reliability, it may also be 
interpreted as a measure of validity. Stanley (1961) observed that each rater may be 
considered a different method of measuring a given construct (e.g. , paper proposal 
quality) . The appropriateness of this interpretation is supported in the present 
context by its equivalent use in the analysis of reviews of manuscripts submitted to 
the Journ a l flf Educational Psychology. (Marsh & Ball, 1981) . This interpretation 
seems particularly appropriate with respect to a global measure of proposal quality 
ihtt ac0B P tabllt y "ting) in light of the report by Littlef ield and Troendle 
(1986). Therefore, the single rater reliabilities (intra-class correlations) 
reported in Table 4 may be equally well interpreted as both single rater construct 
reliability coefficients and single rater validity coefficients. However, 
reliability and validity do not expand in the same manner with increased numbers of 
independent observations. The increase in re) lability is directly proportional to 
the number of observations; the increase in validity is approximately proportional to 
the square root of the number of observations as shown in Equation 4 (Gulliksen, 

(Equation 4) 

V* = V k ^ 2)/((1+(k - 1)r xx )]y2) 

where r i s the validity based on k independent raters; 
r 'is the validity of a single rater; 
r xx is the reliabilty of a single rater; and, 
k is the number of independent raters/ ratings. 

Taole 6 reports the validity of a single rater and the aggregate of four raters 
as measures of global acceptability. As discussed above, the single rater observed 
and adjusted validities are equal to the corresponding single rater observed and 
adjusted reliabilities reported in Table 4. As with reliability, a non-trivial 
improvement in convergent construct validity was obtained by adjusted ratings when 
contrasted with observed ratings. By the same logic as was applied to reliabilities, 
the improvements in validity are real; that is, statistically significant. 

Table 6. Validity of Ratings 

Single Rater Aggregate of Raters 

K=l k=4 
Ooserved Adjusted (Deserved Adjusted 

1983 . 4 5 9 . 5 20 . 5 95 . 6 5 0 

1985 . 393 . 485 . 5 3 2 . 6 19 

1986 .415 .485 .554 .619 



erJc 10 



Cason, Cason, & Stritter 



Page 10 



The origin (i.e., the zero point ) on the ability and stringency scale is 
^trary. in the actual estimation of RRPs one rater's RRP is chosen to anchor the 
-ale and arbitrarily set equal to 500. This process is carried out independently on 
each set of data. In the present case, separate analyses were completed on each 
year's progran proposals. 

We made the plausiDle assumption that mean rater stringency remained constant 
for those raters who participated in reviews for both 1983 and 1985. There were 41 
raters in common between 1983 and 1985. The mean RRP of these 41 raters on the 
original uncalibrated scales were 515.54 and 557.30 for 1983 and 1985 respectively. 
The two scales were calibrated by adjusting all the RRPs such that the 41 common 
raters had a mean RRP of 500. The original 1983 RRPs were adjusted by the additive 
constant -15.54. The original 1985 RRPs were adjusted by the additive constant 



The 40 raters in common between 1985 and 1986 had mean RRPs on the original 
scales of 568.48 and 514.76, respectively, when the original values of these 40 
raters' RRPs on the 1985 scale were adjusted by -57.30 to fall on the calibrated 
scale for 1983-85, their resulting mean on the calibrated scale was 511.78. To 
obtain the same mean for these 40 raters' RRPs on the 1986 data required an 
adjustment of -3.58 for values on the original, uncalibrated 1986 scale, within a 
given year, SAPs and RRPs are determined on the same scale; therefore, adjustments to 
achieve calibration were the same within each year for both RRPs and SAPs. This 
process is analogous to the calibration of exam scores when latent-trait models are 
used and calibration is achieved through equating item difficulties for linking 
items, i.e. , sub-sets of exam items in common between exams. However, because item 
difficulties have much larger standard errors than do RRPs and SAPs, far less data is 
required in the rating case. All further information on and discussion of SAPs and 
RRPs is in terms of values on the calibrated scale defined above. 

Table 7 provides summary information on reviewer standards (RRPs) and proposal 
quality (SAPs) in calibrated scale values for all three programs. In each year the 
mean stringency of program committee members was sligntly greater than non-committee 
member reviewers (although as indicated by the standard errors, not significantly 
so). There was a slight increase in committee member stringency between 1983 and 
1985; followed in 1986 by a decline to approximately the 1983 level. These changes 
were also not statistically significant. CVer the three years, non-committee 
members' average stringency fluctuated even less than did U at of committee members. 
The differences between mean committee memDers' and mean non-committee members' 
stringencies within and across years were also not statistically significant. The 
absence of statistically different mean stringencies indicates that the observed 
differences could be attributed to chance fluctuations in rater stringencies arising 
from random sampling of reviewers from the same nypothetical pool of potential 
reviewers. Nevertheless, any difference in rater standards has the potential of 
making a practical difference with regard to the evaluation of an individual caper 
proposal. 

Proposal quality as measured by mean SAPs of proposals accepted fluctuated 
signiticanctly between 1983 and 1985 and between 1985 and 1986; first declining then 
rising above the 1983 value. In each year the mean quality of the rejected proposals 
was significantly below that of the acceded proposals. Across the three years, the 
quality of rejected proposals declined substantially from 1983 to 1985 then returned 
in 1986 to near the 1983 value. Under the assumption that scale calibration across 
years was successful, the lower proposal quality of 1985 cannot be attributed to the 
concurrent, slightly higher stringencies of reviewers in that year. 



ERIC 



11 



Cason, Cason, & stritter 



Page 11 



Table 7. reviewer Standards and Proposal Quality 

Year 

1983 1985 1986 

Reviewer Stringency (RRP) 
Committee Members 

N 6 8 8 

Mean 499.5 505.7 499.4 

Se 22.6 ll.l 19.8 

Non-Committee Member 

N 81 76 74 

Mean 497.5 496.7 4 92.4 

Se 7.9 13.1 11.6 

Proposal Quality (SAP) 
Accepted 

N 42 33 35 

Mean 571.5 528.7 587.4 

Se 13.8 12.7 24.4 

Rejected 

N 78 67 80 

Mean 443.5 377.7 437.4 

Se 9.8 12.5 11.4 

Table 8 reports the correlations between disposition of proposals (i.e. , 
acceptance - 1, or, rejection = 0 for the program) with the mean observed 
acceptability across 4 reviewers (including one program committee memoer) , the 
adjusted acceptability rating of each proposal, and the acceptability rating given by 
the Program Committee memoer. The moderate values of the correlations between the 
mean observed acceptability ratings and disposition of proposals reflects, in part, 
the less than perfect reliability of this measure which was available at the time 
that the disposition decision was made. According to the informal account of Program 
Committee members, other factors contributing to the moderate correlation between 
mean observed acceptability rating and disposition included: Division I policy to 
encourage participation from professions previously under-represented in the program 
J?.. defac ^° application of less stringent standards, committee members giving 
differential credibility to selected reviewers, accepting only a single paper from a 
given author that proposed two or more closely related papers each of which received 
high acceptability ratings, rejection of papers the same or highly similar to ones 
presented by the author elsewhere in spite of high acceptability ratings. 

Table 8. Correlations: Disposition with Acceptability Ratings 



Disposition with 
Mean Observed Rating 
Adjusted Rating 
Program Committee 
Member's Rating 

Upper Limit of r 





Year 




1983 


1985 


1986 


N=120 


N=100 


N=115 


.64 


.53 


.01 


.60 


.47 


.61 


.56 


.70 


.49 


.84 


.85 


.87 



0 



12 



Cason, Cason, & stritter 



Page 12 



ERIC 



In 1983 and 1986 the acceptability rating given by the Program Canmittee member 
was correlated only slightly less strongly with disposition than the mean 
acceptability rating across all reviewers. This suggests that in these cases the 
actual disposition was influenced by the reviews of non-committee mfaibers. The 
reversal of bus pattern in 1985 may be an artifact of the process used to make the 
disposition decision and to record the Progr am Ocnunittee member's acceptability 
rating. According to one Progran Ccmmittee member, the committee reached consensus 
on the disposition decision and the record of an individual Program Committee 
member's rating of a particular paper was changed to conform with this consensus. 
Presumably the reason for the correlation between disposition and Program Committee 
mancers accecptability rating being only .70 results from a failure to alter the 
recorced individual Progran Committee member's acceptability rating to conform with 
the committee' s decision. 

The lower correlation between the mean observed acceptability rating with 
fPf 1 ^ 011 and the adjusted with the disposition found in 1985 concurrent witn the 
much higher correlation between the Progran Ocnunittee memter's rating and disposition 
is consistent with selective attention on the part of ccmmittee members to other 
reviewers' acceptability ratings that were more "credible". However, these 
apparently anomalous 'results are only suggestive of that hypothesis and are open to 
other interpretations. 

Under Cason and Cason' s model the best available (i.e. , most reliable, valid) 
single measure of proposal acceptability is the adjusted aggregate-multi rater 
acceptability rating. This measure was not available to any of the Progran 
Committees at the time that disposition decisions were made. A measure of how well 
the committee managed to extract the best information fron the observed ratings 
available to them is the correlation between disposition and adjusted ratings. By 
^iis interpretation the Program Committees in 1983 and 1986 did the best and about 
equally well. Die lower correlation between disposition and adjusted ratings in 1985 
suggests that this committee would be less likely to endorse adjusted ratings as 
best' measures in spite of the fact that the Cason and Cason model achieved its best 
5 Hzl data; there was a significant rater stringency effect and the 

a 2°!S!i "tings were construct valid (i.e., .53 for observed vs .62 for 

adjusted) . 

One correlations between the mean observed acceptability ratings and the 
adjusted acceptability ratings for 1983, 1985, and 1986 were respectively .92, .86, 
and .93. The lower correlation found in 1985 data reflects the lower proportion of 
variance attributable to proposal quality and higher proportion of variance 
foS 1 ? 0 ^ 6 , t0 viewer standards in 1985 than in either 1983 or 1986. Similarly 
1985 had the lowest reliability associated with observed acceptability ratings. Each 
?L? eSe £lndings represent related but different quantifications of the fact that in 
1985 the Program Ccmmittee' s task of extracting useful information from the observed 
ratings was more difficult than in the other two years. Furthermore, the correlation 
between the mean observed acceptability rating with disposicon reported in Table 8 
tends to exaggerate the departure between what the committee actually chose to accept 
and acceptance based on a simple selection of the top n pe.pers each year as 
determined strictly on mean observed ratings (where N = nunber of papers accepted 
witnin a given year). Die maximum value of this correlation is a function of the 
52S C 5°i n ^ Proposals to be accepted within any given year and in none of these 
years would it have been equal to 1.00 (McNemac, 1969) . For example, in 1986 when 35 
^Li kP Pe r Pr°E°sals could be accepted, the maximum correlation obtainable between 
mean observed ratings and disposition is .87. 



13 



Cason, Cason, & stritter 



Page 13 



In Table 9, the 1986 data are used to provide a contrast between the results of 
the Program axnmittees , actual selection policy and wha'; the results would have 
looked like had they (a) chosen the top 35 proposals cased on mean observed 
acceptability ratings or (b) chosen the top 35 proposals based on adjusted 
acceptability ratings. For reasons discussed above, the range of acceptability 
ratings for accepted and rejected proposals resulting f ram the Program Committee's 
actual disposition decisions overlaps substantially. At least one paper with a high 
rating (83%) was rejected while at least one with a moderately low rating (34%) was 
accepted. The other two decision rules prohibit overlaps of this kind between 
accepted and rejected proposals. Table 9 shows the second two decision rules result 
in higher mean ratings of accepted and lower mean ratings for rejected proposals with 
the rule based on adjusted scores giving the greatest differentiation between the 
means of the two groups. The correlation between the disposition of the proposals 
and the mean observed acceptability ratings for the actual program was .61. For 
disposition and strict ranking based on mean observed acceptability the correlation 
was .76. Correlation betweer the adjusted score and selection based strictly on 
ranking of the adjusted score was .77. 



Table 9. Contrast Between Alternate Selection Policies 



Accept (n=35) 
Mean 
Se 

Range 



Committee 
Selection 



69.9 
2.6 
33.5-94.2 



Outcome Based on 
Ranking of 
Mean 
Observed Scores 

75.4 
1.6 
62.2-94.2 



Ranking of 
Adjusted Scores 



79.2 
1.7 
65.6-99.0 



Reject (N=80) 

Mean 33.6 37.2 35.1 

Se 2.2 1.9 2.1 

Range 6.3-83.3 6.3-61.2 1.6-65.4 



The correlation between the mean observed and the adjusted acceptability ratings 
of the 1986 proposals was .93. Given this and the relatively modest difference in 
the reliabillcy of these two measures (.74 vs. .79) it might appear that little 
difference would arise f ran chosing to use one or the other. This turns out not to 
be the case. 

Table 10 shows the impact of using either the observed mean or the adjusted 
rating for selecting proposals for inclusion in the program given the simplified 
decision rule of selecting the top 35 proposals, while this table does not directly 
show exactly what the choice between these two measures of acceptability would have 
produced in the context of the Program Committee's more complicated decision rule, it 
is probably highly suggestive and close to what would have happened. Of the 35 
included in the program by the simplified decision rule based on either of these two 
measures, 6 (17%) included under one measure would be excluded under the other and 
visa versa. A 17% change in the specific proposals included in the program is a 
practically important difference. Even if the difference were only a single 
proposal, decisions based on the adjusted ratings would be superior because they are 
based on ooth a more reliable and a more construct valid measure (Stanley, 1961, 
Marsh & Ball, 1981; Cason & Cason, 1984) . 



14 



Cason, Cason, & stritter 



Page 14 



Table 10. Transitions in Accept/Reject Outcome Resulting 
from Using Adjusted or Observed Ratings 



Outcome Based on 
Observed Ratings 



Outcome Based on Adjusted Ratings 
Accept Reject 



Total 



Accept 
Reject 
Total 



29 
6 

35 



6 

74 
80 



35 
80 
115 



All of the analyses reported to this point have been based on the acceptability 
rating ] coated at the top of the Division reviewer inventory (see Figure 4) . That 
inventory v\Lso requests that the reviewer rate the proposal on seven quality 
criteria. >iarrently the information or the multiple forms completed on an individual 
proposal is not systematically, formally integrated into a composite report for use 
by the Program Committee nor feedback to the proposal authors or proposal reviewers. 
As the information contained in this section of the reviewer's inventory is 
potentially as useful as the global acceptability recommendations, cursory analyses 
of the proposal quality data were completed on the 1986 data (this is the only year 
for which these data were made available to the researchers) . 

The first step in the analysis of these data was the transfer of quality ratings 
from the reviewer inventory to OTS-ER machine scannable rating sheets (See Figure 5) . 
These rating sheets were then scanned and processed by the (JAMS OTS-ER system which 
generated summary reports including inventory analyses and proposal quality 
summaries. Figure 6 which is photo- reduced output f rem the UAMS OTS-ER system 
provides, the single rater and k rater mean reliabilities (where k is the geometric 
mean nunber of raters per paper; Ebel, 1951) for each of the seven quality criteria, 
the mean observed rating, the standard deviation and standard error of measurement as 
well as the same statistics on the average across criteria (i.e. , the average or 
overall proposal quality rating). The moderately low reliabilities for the average 
across multiple raters for each of the criteria and the average or overall quality 
rating leaves substantial roam for improvement in the scale itself and the way in 
which it is used by reviewers. The single rater reliabilities reported in Figure 6 
are the convergent validity coefficients for each of the quality criteria and the 
sunmative total across these criteria. These values were computed in the same manner 
(i.e. , as single rater reliabilities/intra-class correlations) as that used by Marsh 
and Ball to compute the diagonal element (convergent validities) in their 
multi-method (rater 1, rater 2) multi-trait (manuscript review subscales) analysis in 
accord with Campbell and Fiske (1959) . The note to Table 2 by Marsh and Ball 
explicitly states this equivalence between single rater reliabilities and convergent 
validity coefficients. The validities for Marsh and Ball's subscales ranged f rem .20 
to .27. As can be seen from Figure 6, the analogous validities for the 1986 quality 
criteria (subscales) ranged from .17 to .24. 

The validity found by Marsh and Ball for the overall recommendation for 
acceptance of a manuscript was .34. For the 1986 Division I data the comparable 
value was .415 (calculated as a single rater reliability because the single rater 
reliability is equivalent to the two rater intercor relation given by the intra-class 
correlation), in actual fact both of these numbers represent underestimates of the 
true validity of the data, in Marsh and Ball's analysis two raters' data on each 
manuscript were available, m the 1986 Division I data four independent ratings were 



e. The single rater validities given above must be expanded using Equation 4 



15 



V 



Cason, Cason, & Stritter 



to determine the validity of the aggregate of multiple independent ratings. with 
respect to overall recommendation or acceptability rating, the Division I review 
process is more valid than that found in Marsh and Ball's study as a result of both 
(a) higher validity at the single rater level; and (b) more independent ratings per 
manuscript/papa r proposal. In fact, in each of the three years analyzed the Division 
I review process had greater single rater validity and a greater nimber of 
independent ratings per manuscript/paper proposal than in the Marsh and Ball study. 

The correlation between the average quality and the mean observed acceptability 
rating was .83. Die moderately high correlation between the mean quality rating and 
the mean observed acceptability rating indicates that these two measures reflect 
substantially but not exactly the same thing (they share 60+% of the variance). This 
leaves unresolved the question of whether one or the other or sane explicit 
combination of the two measures would best serve the Program Committee as a summative 
integration of the information provided by the reviewers. Littlef ield and Troendle's 
(1986) results suggest including acceptability as a subscale within the list of 
quality criteria, preferably at the top of the list. 

Figure 7 parts A, B, and C provide photo-reduced facsimiles of the individual 
performance reports (IHRs) generated by UAMS OTS-PR on the best, a near average, and 
the weakest proposals, as measured by mean quality rating, submitted to the 1986 
Progran Committee. 

The principle use of the OTS-IR system with respect to rating data is processing 
ratings of students' performance in clinical settings as part of degree/credit 
granting courses and clerkships and formal training programs such as residency 
programs. For this reason, labeling of same aspects of the report is at variance 
with the current application: ,, students ,, are the subjects of the evcduation, "class" 
is the collective group of subjects upon whom evaluations were conducted (in this 
case all 1986 proposals) . The IFR provides, in both graphic and tabular form, 
information on an individual's performance on each item and the performance of an 
average member of his comparison group. The standard error of measurement is 
provided on each item as well as on the subject's total score. The standard 
deviation of scores in the class is provided on each item and the total. The number 
of raters upon whom a given subject's average on an item was computed is provided in 
the right most column of the report. Note that the number of raters varies f ran 
report to report and item to item in these examples. The number given is the number 
of ratings in the valid range, i.e. , 1 to 5, and excludes omissions, NAs, etc. 

A glance at the graphic portions of these three reports quickly conveys the 
relative strengths and weaknesses by quality criteria of average proposals submitted 
in 1986: lower case "c" profiles in each graph. These reports also rapidly 
communicate the range of quality on each criterion f ran the best proposal to the 
weakest proposal: lower case "x" profiles. 

In passing, it is worth noting that the greatest weakness on average in the 1986 
proposals is the credibility of findings and conclusions. On average, the greatest 
strength of the 1986 proposals was appropriateness to Division I. Do these results 
imply that those proposals highly appropriate to Division I lacK credibility in their 
findings and conclusions? This question emphasizes seme of the amoiguities and 
uncertainties in the intended meaning of the quality criteria. 



0 



16 



Cason, Cason, & Stritter 



Page 16 



Conclusions 

The results relating disposition of proposals to reviewer rated acceptability 
when combined with the obtained reliabilities for observed mean acceptability ratings 
clearly indicate that in the three years studied, the Program Commitcee chair and 
members, reviewers, and the general Division I review process did an excellent job in 
selecting high quality programs for the Division. There is a clear distinction in 
the quality of papers, on average, accepted and rejected and reasonable policy 
explanations for why some relatively high rated proposals could be rejected and/or 
moderately lew rated proposals might be accepted. The Division I review process was 
shown to be both more valid and reliable than that reported in an analysis of 
manuscripts submitted to a high quality peer review journal concerned with a domain 
of research problems in many ways similar to that of interest to Division I. 

Cason and Cason' s simplified model of performance rating fit each set of review 
data. Empirical support was found in all years for both major model constructs: 
stringency (rater standards) and ability (proposal quality) . Even in that year 
(1983) where no significant stringency effect was directly observed, the assignment 
of part of the variance to stringency could not be discounted as capitalizing on 
chance. This follows f rem the significant correlation of stringencies estimated for 
raters participating in the review process in both 1983 and 1985. On average, the 
stringency of both committee and non-committee reviewers may be interpreted as drawn 
f rem the same hypothetical population of potential reviewers. It is reasonable to 
expect that Cason and Cason' s model would fit future program review data unless the 
rater pool changed in seme substantial way. 

Application of the model permitted partitioning of the variance so that a more 
valid and reliable measure of proposal accceptability than represented by the mean 
observed acceptability rating was extracted f rem the data. Even thougn the increase 
in validity and reliability was modest, the adjusted ratings were nevertheless more 
valid and reliaole. 

It might appear easy to dismiss these results as trivial even though 
statistically significant because reliability was improved in 1986, for example, only 
frcm .74 to .79 and validity improved f rem .55 to .62. However, this improvement 
could result in up to a 17% change in the composition of the program. Even if the 
acceptance or rejection of only a single proposal were affected, the adjusted ratings 
provide the preferred criteria. 

It might also be tempting to dismiss or undervalue the results because of the 
presumptively multi-dimensional nature of both the measurement of the quality of tne 
proposals and the decision to include or exclude papers proposed for the program. 
Supporters of this view would likely take encouragement frcm the relatively large 
unexplained variance in each year's data; the Cason and Cason model accounted for 
only 56% of the variance in the 1986 data, leaving 44% as unexplained error. 
However, the Cason and Cason model can be applied in a multi-dimensional manner. If 
independent (i.e., orthogonal) evaluative criteria can be identified, Cason and 
Cason' s model can be applied separately to each factor and achieve a reduction in the 
error term for variation in rater standards within each factor. Thus, wnile a 
multi-dimensional analysis might account for a greater proportion of variance than 
did the uni-dimensional analysis reported here, for each factor in that analysis it 
is likely that partialling out rater standards in each factor would produce a further 
incremental reduction in error variance for eacn factor. This effect has already 
been demonstrated in ratings of medical students' performance in a clinical practice 
setting (C. Cason, G. Cason, & Littlef ield, 1983) . 



ERJ.C 



17 



Cason, Cason, & stritter 



Page 17 



These results suggest that there are several areas in which changes mignt 
provide improvements in the validity and reliability of the review process. The 
results clearly shew that there is significant variation in rater standards which 
affects the validity and reliability of the review process. The task of the Program 
Committee wo old be made less difficult were they provided adjusted acceptability 
ratings at the time the decision to include or exclude paper proposals is made. The 
Program Committee would also likely find it useful were they provided a summary 
quantitative report integrating the ratings provided by each reviewer of each 
separate proposal (e.g. , similar to those illustrated in Figure 7). In addition 
there may be improvements that are possible with respect to the separate quality 
items or their definitions to be included in the proposal review inventory. 
According to Marsh and Ball improvements in review inventory content have indeed been 
accomplished by Gottfredson (1978) . Furthermore, an analysis of the validity and 
reliability of the review and acceptance process should be a routine part of that 
process. This would provide the Program Committee a means for both monitoring the 
current quality of the process and movement toward the goal of improved reliability 
and validity of the process. Implementation of these suggested changes requires the 
availability and application of machine and/or computer based automation technologies 
for the collection, analysis, and reporting of rating data. 

Importance 

Progress in professions education research depends, in part, upon an efficient, 
effective, and believable process for selecting the best papers and articles for 
inclusion in professional meetings and journals. The approach presented here permits 
assessment of the current state and progress toward improving the peer review and 
program committee processes in Division I and has potential for use in other similar 
settings. Certainly, the results support a greater level of confidence in the 
selection process than some might have otherwise believed. Yet, there is clearly 
room and need for further improvement. Tne methods presented and suggestions made 
can provide part of the basis for making such improvements. The results reported 
above emphasize our need to take heed of Harsh and Ball's wry obsevation "It seems 
ironic that scientific method has scarecely been used to determine how best to 
evaluate the products of scientific research" (p. 880) . 

References 

Campbell, D.T. & Fiske, D.C (1959). Convergent and discriminant validation by the 
multitrait-multi method matrix. Psychological Bulletin. 5£, 81-105. 

Cason, C.L. , Cason, G.J. , & Littlef ield, J.H. (1983). Variation of intra-rater 
stringency in cognitive- technical and affective- interpersonal clinical performance 
cbmains. Presented at AERA, Montreal. 

Cason, G.J. , & Cason, C.L. (1985) . A regression solution to Cason and Cason' s model 
of clinical performance rating: Easier, cheaper, faster. Presented at AERA, 
Chicago. 

Cason, G.J. , & Cason, CL. (1984). A deterministic theory of clinical performance 
rating: Premising early results. Evaluation k £n£ fleaUH Professions. 2(2) , 



Cason, G.J. , Cason, CL. , & Littlef ield, J.H. (1983). Controlling rater stringency 
error in clinical performance rating: Further validation of a performance rating 
theory. Presented at AERA, Montreal. Resources jn Education. 1£(8) , 176. (ERIC 



0 



18 



Cason, Cason, & stritter 



Page 18 



ED-228-314) 

Ebel, R.E. (1951). Estimation of reliability of ratings. 
407-424. * 

Gottfredson, S.D. (1978). Evaluating psychological research reports: Dimensions, 
reliability and correlates of quality judgements. American Psycnoloqist, 23, 
915-929. 

Gulliksen, H. (1950) . Uieory sL mental Jtesta. New York: Wiley. 

Hays, w.L. (1963) . Statistics . New York: Holt, Rinehardt & Winston. 

Littlefield, J.H., Ellis, R. , Cohen, P., & Herbert, R. (1984). Leniency and score 
distribution differences among clinical raters. In Research in Medical Education 
12B4i Proceedings oL ine 2icd Annual Conference . Washington, D.C : Association of 
American Medical Colleges, 199-204. 

Littlefield, J.H., & Troendle, G.R. (1986). Rating format effects on racer 
agreement and reliability. Presented at AERA: San Francisco. 

Marsh , W.H., & Ball, S. (1981). Interj udgemental reliability of reviewers for the 
Journal oL Educational Psychology. journal sL Educational psychology. 22.(6) , 
872~880 • 

McNeroar, Q. (1969). P sychological Statistics (4th ed.). New York: Wiley. 

Stanley, j.c. (1961). Analysis of unreplicated three-way classifications with 
applications to rater Dias and trait independence. Psychcmetrika. 25.(2) , 203-219. 

Ward, J. , & Jennings, E. (1973). Introduction £a linear, models . Englewood Cliffs, 
NJ: Prentice-Hall. 



ERIC 



19 



R 
A 

T 
I 
N 
G 

\ 
P 
E 
R 
C 
E 
N 
T 
\ 




RCC 




i r " ■ * 

i ; 


EFFECTIVE 


RATING 


CEILING | f 


EFFECTIVE 

1 1 


RATING 

i 


FLOOR j ! 

1 j 

-! 



LOU 



-L- -s- 

SUBJECT ABILITY AND RATER STRINGENCY 



I 

HIGH 



Figure 1 



Exanple rater characteristic curve (RCC) Sr^n a , 

<-urve ^ku,; Stringency L gives rating R 



to ability S 



-RA- 



R 
A 

T 
I 
N 
G 



-RB- 




LOW 



-K- -S- -L- HIGH 

SUBJECT ABILITY AND RATER STRINGENCY 



Figure 2. Raters A and B of stringencies K and L 



subject of ability S ratings RA and RB respectively 



ERJC 



BEST COPY AVAILABLE 



BEST COPY AVAILABLE 



E 100 - 




DISTANCE FROM RATER POINT TO SUBJECT POINT 



figure? Fvpected Rating as a Function of Distance Between Rater Reference Point and Subject Ability Point 



ERIC 



21 



DIVISION I -- AERA 
1985 PAPER PROPOSAL EVALUATION FORM 



Paper ID#: P-_ 
Paper title: 



Reviewer: 



Recommendation : 

1. Definitely accept. 

2. Acceptable; suggest minor changes. 

3. Accept only if space permits. Weaknesses noted in comments, 

4. Reject. 

If rated 1 or 2, list suggestions for discussants below. 
Narre: Phone: 



Affiliation: 
Name: 



Phone 



Affiliation: 



Below this line will be sent to authors. 
ID#: P- Paper title: 



LU 
— I 
OQ 

3 

> 

>- 

Ou 
O 
CJ> 

H— 
CO 
LU 
OQ 



Clarity of Summary 



Not 

Applicable 



Relevance of Problem 
(to education, to 
society ) 

Theoretical Framework 



Methodology or Mode 
of Inquiry 



Execution of Study 
(coherence, clarity) 

Findings & Conclusions 



Appropriateness to 
Division I 



+ + 


+ 


+ 


+ 


Obscure 






Clear 


incomplete ) 




(all elements treated) 


+ + 


+ 


+ 


+ 


Insigni f icant 






Important 
to field 


+ + 


+ 


+ 


+ 


Non-existant 






Well grounded 


+ + 


' + 


+ 


+ 


Insuf f iently 






Highly 


developed 






Appropriate 


+ + 


+ 


+ 


+ 


Unsystematic 






Carefully done 


+ + 


+ 


+ 


+ 


Lacks credibility 






Well founded 


or overstated 








+ + 


+ 


+ 


+ 



Inappropriate 
(Where should it be 



o . 22 

ERICE.XPLAXN REASON FOR RANKS (use reverse side of paper •tf-recessarv) 



Highly 
Appropriate 



SUBJECT NAME 
(PERSON RATED): 



P ABfcy Z. 



SUBJECT IDENTIFICATION 



i 



PLEASE: 

SEE INSTRUCTIONS ON 
BACK BEFORt START- 
ING. UNLESS OTHER- 
WISE DIRECTED, ALL 
I.D. GRIDS TO RIGHT 
MUST BE COMPLETE 
BEFORE SUBMITTING 
FORM FOR MACHINE 
READING £r SCORING. 

USE SOFT. BLACK PENCIL 
ONE MARK PER ITEM. 
USE POINT SCALE BELOW 



MARKING 
EXAMPLE 

O#O0K 

O 0 O WRONG 
O 0 O WRONG 
O ® O WRONG 
O (§) O WRONG 
OQO WRONG 



(^ALIGNMENT CHECK 
0® ® 0® * 



I 



I 



0© © © © 

©_©_®_©JD_ 
©©_©_©_© 
© © © © df 



©_©_©_©©_ 
© © © © © 



O® © © © 



Q © © © © 
® © © © © * 
Q © © © © 



0_0®_0_® 
©_©_© 0_©T 
Oj&J&Oj!) 
© © © © ©~ 



©©©©©.< 



I 



© © © © © 



5) 



© © © © © 



0 0 8 8 0 6 8 2 9 



SUBJECT I.D. NUMBER 



G>t>®©<D®®®® 

O ©©©©©©©.© 
® 0 ©,©©© ©;•© 
© © © © © ©.© ©@ 
© © © © © ©.©'© © 
© © © © ©.© © © © 
© © © © ©<D ©'© ® 
©©©©©©©©© 
© © f) m ® ©■# © © 
©©©©©©©©$ 




SUBJECT TYPE 
(OH RANK) 

MARK EMPTY CIRCLE IF NO INITIAL) 



!i; ^ll? c U*486 PROPOSES 



O©©©®©©®©©©©©®©®©©®©©©©®© ©;(|> 

#;© ©"'©1©?©'©,'® ©.© 0© © ® © ® © ©'©:© '©'© © ® ©v*©'© 



NAME: 



A ERA DIVISION I PROGRAM; 1986 



CHIEF INSTRUCTOR, 
COORDINATOR, OR 
DIRECTOR: 

. CAROLE J. BLAND., 



op, 
oo 



DO 
OO 



oo 



oo 



oo 



oo 



oo 



oo 
oo 

QQ 



oo 
oo 



QQ 



PHD 



© © © © ©■«> © 0 © © 

$ © ©©'©.© © © © © 

© ©:# ©©©©©©© 

©Jf) ®®®0®0®® 
© f) ©©©©©Q©© 




©©©©©©©©©© 
© ©.© '©'©"©'© ©'.©I® 
®O®®©©©0®® 

© © © © : .©- ©„©.©,©;£) 

®©®®©©©©0® 



NAME:. 



TYPE OR RANK: 




5 - OUTSTANDING' 
4 = VERY GOOD 1 1 
3 = GOOD 
2 = POOR 
1 * VERY POOR 




USE TO RATE PAPER PROPOSAL ^VsUM MARY >l^^ RAtllSiG FORM CPri-"!{r>j 



0\ ; 



CLARITY 



PROBLEM RELEVANCE/IMPORTANCE 



THEORETICAL FRAME MURK 



rtETH'ODOL"0CT !7MQD£ Of INQUIRY 
EXECUTION OF STUDY 



CRE'DTblLITk Of FINDINGS/CONCLUSIONS 



OO 

Q QUPPHOPRlATlNESS .' 10 D IVISION I, 

OO 



..MJ.niH '-'A- ^'J!i * r - 



©_©©_©_©, 

© ©JD_©_©_ 
O © © © © 



©©©©©♦> 
0 © © © © 



©©©©'©a? q q 



© © © © © 



© © © © ©' y 
©_©_©_©_© 
O_0_©_©_©1: 
OOJDO_® 

© ©J©QXD]3 
Oj2)_®_©_©_ 

OilL©_® 
O®_0_©_©_ 
OOjlO©^. 
^®JDj2)_©_ 

©_©_©_©_©_ 
©_©_©_©_© 



10 © © © (T / 



Q.QJ 
OO 



QQ 
QQ1 



QJQ 
OTQ 



Q.QJ 
QQ 
QQ 
QQ 
QOJ 
O.O. 
QO. 
O.O. 
QQ 
QQ 
QJQ. 



AKE NO FOLDS WHICH CROSS EDGE BELOW. 

I I I I I I I I I I I I II I I 



RAW- 



J RANK- 



J 



TOTALS (LINE PRINTER ENTRY ONLY) P.L Q. ( _ _ , 

H>Y AVAILABLE 



RATING ANALYSIS SUftftRY (Current Rating) 

?est = 1 - 1986 PROPOSAL EVALUATIONS 

Instructor: CARO.E J. BUND, fflD 

Course AERA DIVISION I HWERAK: 1986 



Category 
1 (XflUTY 



# Rated (tans fc 
Total points (1) 



?f,f£if™ J?7 Jun "? 6 ? 0:( *L^ ^ UAMS C^S/PR System 
(version Al) as implemented at UAMS 

Dept: Educational Development 

SlpJ:: 595 Phone: 661-5720 

aabjects rated;115 Absent: 2 Withdrawn: 1 



Average 
Raw 



Score 
5 Pt 



1 Rater 
Reliab. 



Mean # Raters & 
Reliability (2) 



Standard Dev 
5 Pt 



1 

1 

1 
1 
1 
1 



Std Error of Measure 
5 Pt Z (3) 



Overall 



3.5 



3.2 

U 

4.0 



3.52 

3.i5 
3.32 
3.01 
3.97 



.24 
.18 
.23 
.24 

,23 
.19 
,17 



35 



0.55 
0.46 
0.52 
0.55 
0.50 
0.43 
0.43 



0.8 
0.6 



.755 
.645 
.743 
.845 
.822 
.889 
.681 



0.51 
0.47 
0.51 
0.57 
0.58 
0.67 
0.51 



24.0 3.43 



0.25 



3.9 0.56 



4.0 



.575 



(1> Ste^ *tn*^e% o^ 8 i^. inGlU<te header ite » s ' ""rated "ens which head 



(2) 



as: 



Below 0.60 
60 - 0.69 
. 70 - 0.79 
0.80 - 0.90 
0.90 - 1.00 



8 



U^CCEpr^BLE as sole basis for evaluation 
|DOk as sole basis for evaluation 

™» as sole bails for evaluation 

SnRriMrr 38 *4 e tenia for evaluation 
EXCELLENT as sole basis for evaluation 



(3> ^l^^ur^uent MMJS }fflJ3^* 0 && that a subject's 

estimate assumes that the ncUber of raters i2 thr^ r ^h^K hl f*.?"5 JLn ?i ratln g score. This 
has occur ed in the level ^ cf taowlSap or ahii ^ f+^J&JS** 1 "^59? and that no real change 
prctabffiges that *?&iVK^iS St^&^oS^ianB S^SSgSUf^ftr* ^ 



Level of Confidence 
(Probability) 
75% 1 
95% 
99% 



Predicted Score Range 
Original Score + ana - 
1.15 times sm 
1.96 times SGi 
2.58 times SEM 



^V££i&?^^ increases, ^us, performance 



IS 



Figure 6. Rating analysis summary 



.51 
.47 
.51 
.57 
.58 
.67 
.51 



67 
74 

69 
67 
71 
75 
75 



2.65 .38 66 



ERJC 



BEST COPY AVAILABCE 

4T 



INDIVIDUAL PERFORMANCE RETORT (Current Rating) 

To Student: 585000441 

From: CARCLE J. BUND* PHD 

Re: Rating/test 1-1986 PROPOSAL EVALUATIONS 



Prepared 10-Jun-8f 00:10 by the UAMS OTS/PR System 
(version Al) as implemented at UAMS 



Item 

Class Overall Mean Rating « 3.43— > 1- 
Your CVerall Mean Rating « 4.54— > 

1 CLARITY — 

2 IRCBLEM RaB/ANCE/IMTORTflNCE 

3 THEORETICAL FRAHWORK — 

I BffifiP&nv INauiRY - 



Dept: Eduaational Development 
ODurse: AGRA DIVISION I EROGRAM: 



1986 



5 Point Scale 

3 

C 



-x-. — 



. — x- 



— Your 
Mean=x 

4.50 
4.75 
4.25 
5.00 
4.50 
4.00 
4.75 



5 Pt Score 



SEW 

.50 
.47 
.50 



Class — 
Mean=c stdDev 



Ratf Score — 



52 
80 



5 « OUTSTANDING 
4 - VETO GOOD 

- GOOD 
_ » POOR 
1 « VERY POOR 



Rating Scale — 



.64 

.51 



3.20 
3.15 
3.32 
3.01 
3.97 



.755 
.645 
.743 
.845 
.822 
.889 
.681 



Perfect 
P 

5 
5 
5 
5 
5 
5 
5 



Yours 
xp/5 

4.50 
4.75 
4.25 
5.00 
4.50 
4.00 
4.75 



# of 

Raters 



C 
X 

c 

X 



— Definition of Symbols — 

class overall 5 pt score: 3.43 StdDev: 



Your overall raw score 31.75 (out of perfect 35) yields: 90. 



your overall 5 pt score: 4.54 SEM ; p 
»Sfn 5 pt score on item (or category) 
your m mean 5 pt score on item (or category) 



.575 
.38 



SEw= Standard Error of Measurement 
7% Z= 693 Rank= 1 (out of 115). Qass ave raw score 23.99 



Figure 7A. Individual performance report for best proposal 



26 

0 BEST COPY AVAILABLE 

ERIC 



INDIVIDUAL FERFOHMAN0B RETORT (Current Rating) 

lb Student: 585000300 

Fran: CAROLE J. BUND, PHD 

Re: Rating/test 1-1986 PROPOSAL EVALUATIONS 



Item 

Class Overall Mean Rating 
Your CVer all Mean Rating 

1 CLARITY 

2 HKBLEM RELEVANCE /IMPORTANCE 



Prepared 10-Jun-86 00:10 by the UAMS 0T5/PR System 
(version Al) as implemented at UAfE 

Dept: Educational Development 
ODurse: AERA DIVISION I &GGRAM: 1986 



3.43- 
3.43- 



Point Scale 

3 

C 
X 



1 APFROreiAT&JESS TO DIVISION I — 



— Your — 
MeaiiFX SEM 

.50 
.47 
?0 



5 Pt Score 



Class 

Mean=c stdDev 



— Raw Score — # of 
Perfect Yours Raters 
P xp/5 



5 * OUTSTANDING 

* VERY GOOD 
_ « GOLD 
2 - POOR 
1 « VERY POOR 



Rating Scale — 



3 

.64 
.54 



3.52 
3.80 
3.20 
3.15 
3.32 
3.01 
3.97 



.755 
.645 
.743 
.845 
.822 
.889 
.681 



3.75 
3.75 
2.75 
2.75 
3.50 
3.50 
4.67 



C 
X 
c 

X 



Definition of symbols 



class overall 5 pt score: 3.43 
your_ cverall_5 pt^scote: . 3.43 . 



StdDev: .575 

r , SEM : ,38 

ass mean 5 pt score on item (or category) 
your mean 5 pt score on item (or category) 



Your overall raa score 24.04 (out of perfect 35) yields: 



SE»= Standard Error of Measurement 
68.7% Z= 501 Rank= 59 (out of 115). 



Class ave raw score 23.99 



Figure 7B* Individual performance report for proposal of average quality 



27 

BEST COPY AVAILABLE 

ERIC 



INDIVIDUAL IERFORMAN05 REPORT (Current Rating) 

To Student: 585001027 

Prom: CAROLE J. BUND, PHD 

Re: Rating/test 1-1986 PROIOSAL EVALUATIONS 



Item 

Class Overall Mean Rating - 3.43— > 1- 
Your (Verall Mean Rating « 1.87— > 

1 CLARITY — 

2 ro®UM RSjB/ANGE/IMEORTRNCB — 

3 THEORETICAL REWORK — 



Prepared lO-JUn-86 00:10 by the UAf6 OTS/PR System 
(version Al) as implemented at (JAMS 

Etept: Educational Development 
Course: AERA DIVISION I PROGRAM: 1986 



5 R)int Scale 

3 

C 



-c- 
— c 



Rating Scale — 



5 - OUTSTANDING 
4 * VERY GOOD 
3 « GOOD 
2 « POOR 
1 - VERY POOR 

Your overall raw score 13.09 (out of perfect 35) yields: 





5 Pt 


— Your 




Mean=x 


SEM 


2.00 


.54 


2.00 


.47 


1.75 


.50 


1.75 


.56 


2.00 


.65 


1.00 


.74 


2.00 


.51 



Sao re 

— Class — 
Mean=c StdDev 



— Ratf Score — 



3.52 
3.80 
3.20 
3.15 
3.32 
3.01 
3.97 



.755 
.645 
.743 
.845 
.822 
.889 
.681 



Barfect 
P 

5 
5 
5 
5 
5 
5 
5 



Yours 
xp/5 

2.00 
2.00 
1.75 
1.75 
2.00 
1.00 
2.00 



C 
X 
c 
x 



Def inition of Svirbols 

class overall 5 pt score: 3.43 StdDev: .575 
your overall 5 pt score: 1.87 SEH : !38 
class mean 5 pt score on item (or cateqorvJ 
n*an 5 pt score on item (or category) 



SEM= Standard Error of Measurement 
37.4% Z= 229 Rcmk= 115 (out of 115) 



Qass ave raw score 



Figure 7C. Individual performance report for weakest proposal 



28 

BEST COPY AVAILABLE 



9 



